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Abstract.  We  present  t.he  met. /.ecology  and  recognition  performance 
characteristics  coed  in  the  face  Recognition  Vendor  Teat.  2002.  We  refine 
the  notion  of  a biometric  imposter  and  chov,  that  the  traditional  mea- 
sures of  identification  and  verification  performance  are  limiting  cases 
of  the  open-universe  watch  list  ta.sk.  'I  he  watch  list,  problem  general- 
izes the  tradeoff  of  detection  and  identification  of  persons  of  interest 
against  a false  alarm  rate.  In  addition  we  use  performance  scores  on 
disjoint  populations  to  establish  a means  of  computing  and  displaying 
distribution-free  estimates  of  the  variation  of  verification  vs.  false  alarm 
performance  finally  v/e  formal;/.*  gaiier  normalization  an  ion  is  an  ex 
tension  of  previous  evai  .at  ion  methodologies  v/e  define  a pair  of  gallery 
dependent  mappings  that  can  oe  applied  a.  a post  recognition  step  to 
vectors  of  distance  or  similarity  scores.  All  the  methods  are  biometric 
non-specific  arm  applicable  to  large  pop  ;!a* ior.s 


1 Introduction 

The  evaluation  protocol  2 used  m Ffi.V'l  2002  '4J  is  a frame v/ork  for  the  quan- 
titative determination  of  the  performance  of  recognition  technologies  using  ar- 
bitrary biometrics.  Specifically,  it  applies  \<>  eitner  online  dive  n moan  subjects; 
or  offline  0 stored  imagery)  testing  as  long  as  it  prod  uces  outp  v recognition  data 
that  is  available  for  sfiosequers.  analysis  This  has  the  advantage  that  tests  can 
be  conducted  uniformly  and  fairly  across  participants  the  res -fits  are  repeatable, 
and  very  large  image  sets  can  be  tested  expeditious! y.  We  first  present  normal- 
ization in  section  3.1  and  then  introduce  t.ne  terminology  of  toe  FRVT  2002 
protocol  and  define  our  performance  metrics  in  section  2. 

In  the  FRVT.  2002  protocol  a system  is  req  fired  to  compare  two  biorretrk 
signatures  and  report  a scalar  similarity  score.  A biometric  signature  can  be  an 
arbitrary  ensemble  of  pieces  of  imager v from  an  individual  for  example  a face 
and  a fingerprint,  or  face  and  voice  audio  seq  lo-nee.  In  F PVT  2002  both  single 
face  stills  and  video  sequences  were  used.  We  use  the  term  .mage  to  include 
such  modalities,  unless  stated  oth.erv.sse  A similarity  score-  is  a measure  of  the 
sameness  of  identity  of  tne  individualr  appearing  in  the  image?  Without  loss  of 
generality  the  protocol  req  fires  that  image  of  mo  same  person  have  a.  large  sim- 
ilarity score.  .Systems  reporting  distance  measures  v.scere  a small  value  indicates 
sameness  of  identity,  have  their  values  negated  before  any  processing. 
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The  FRVT  2002  tests  are  structured  around  sets  of  images.  An  algorithm 
compares  all  images  in  a query  set,  Q,  with  all  images  in  a target  set  T.  The 
result  is  a similarity  matrix  whose  ij- th  element  is  the  similarity  between  the 
z-th  element  of  T and  the  j-th  element  of  Q.  The  matrix  is  computed  and  stored 
in  column  order;  each  column  corresponds  to  an  unknown  query  image  being 
compared  with  all  the  known,  enrolled,  target  images.  In  the  general  case  the 
matrix  is  rectangular,  but  for  FRVT  2002  we  used  the  special  case  T — Q. 

This  framework  allows  for  great  flexibility  in  arriving  at  quantitative  results. 
The  sets  T and  Q are  not  normally  used  directly.  Instead  we  consider  virtual 
image  sets,  a conceptual  innovation  first  defined  in  the  FERET  protocol[3].  Here 
a gallery , Q,  a subset  of  T,  contains  identically  one  signature  per  subject  and 
represents  the  set  of  images  that  have  been  enrolled  in  a biometric  system. 
Likewise  a probe  set,  Vg,  is  a subset  of  Q.  Each  of  its  images  have  a match 
in  the  gallery,  and  represent  a legitimate  user.  The  images  of  a third  set,  the 
imposter  set  Vjy,  also  a subset  of  Q,  do  not  have  a match  in  the  gallery.  This 
set  represents  persons  attempting  to  defeat  a system.  A match  describes  the 
comparison  of  probe  and  gallery  images  of  the  same  individual.  A non-match 
likewise  arises  from  images  of  different  persons. 

The  utility  of  this  framework  is  that  many  different  recognition  experiments 
are  embedded  within  T and  Q.  All  the  results  of  FRVT  2002  are  obtained  from 
the  similarity  elements  corresponding  to  the  rows  defined  by  the  subset  Q,  and 
the  columns  defined  by  Vg  and  Vjy.  Together  these  form  a similarity  matrix 
S from  which  various  performance  statistics  are  computed.  Disjoint  gallery  and 
probe  sets  allow  performance  to  be  estimated  on  particular  recognition  tasks.  For 
example  to  explore  the  effect  of  subject  ageing,  a gallery  containing  the  earliest 
image  of  all  persons  is  used  with  a probe  set  of  more  recent  images. 

This  protocol  evaluates  recognition  technologies  rather  than  deployed  sys- 
tems. Particularly  it  ignores  efficiency  and  performance  when  databases  are  par- 
titioned. See  Wayman  [5]  for  a treatment  of  these  issues. 

1.1  Normalization 

The  FRVT  2002  protocol  allows  a normalization  option.  This  is  a post-processing 
transform  of  similarity  scores,  that,  may  exploit  the  fact  that  the  gallery,  unlike 
the  target  set,  contains  only  one  image  per  person  by  definition.  Specifically,  a 
vector,  s,  contains  the  column  of  elements  sip  from  S corresponding  to  the  single 
probe  p against  gallery  Q.  Normalization  is  defined  as  a function,  / : RN  — > RN , 
mapping  s to  a new  vector,  t.  For  an  algorithm  that  uses  normalization,  the  fi- 
nal performance  scores  are  computed  over  these  transformed  values.  Notably  the 
normalization  option  is  operationally  realistic  only  if  each  probe  is  processed  in- 
dependently of  all  others.  Certain  unjustified  performance  gains  may  be  realized 
if  normalization  were  allowed  across  probes. 

Two  classes  of  normalization  functions  were  allowed  in  FRVT  2002.  The 
first  is  simply  t = /i(s).  The  second  also  takes  a matrix  of  similarities,  Sec, 
between  all  gallery  pairs,  whose  off-diagonal  elements  contain  information  avail- 
able to  an  operational  system.  The  second  form  of  normalization,  then,  is  simply 
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t = /2(s,  Sgg)-  Notably,  the  use  of  this  second  form  is  impractical  for  even  mod- 
erate gallery  sizes  because  Sgg  imparts  O (N2)  resource  constraints.  FRVT  2002 
participants  were  given  the  option  of  submitting  different  f i and  /2  functions 
for  each  of  the  recognition  tasks:  identification,  verification  and  watch  list,  dis- 
cussed in  the  next  section.  The  functions  were  supplied  to,  and  run  by,  NIST  as 
functions  in  an  object  file. 

It  should  be  noted  that  the  use  of  normalization  causes  all  searches  to  become 
one-to-many  operations.  The  verification  task,  detailed  below,  is  usually  consid- 
ered a one-to-one  search  and  is  correspondingly  efficient,  but  if  normalization  is 
to  be  used  then  a gallery,  of  some  size  and  content,  must  be  constructed.  If  the 
gallery  is  changed,  normalization  must  be  recomputed. 

2 Performance  Metrics 

In  FRVT  2002,  we  evaluate  an  algorithm  on  three  related  tasks:  identification, 
verification,  and  watch  list,  and  we  state  separate  appropriate  statistics  for  each. 
As  described  above,  performance  on  each  of  these  tasks  is  obtained  solely  from 
the  similarity  values  extracted  from  the  similarity  matrix  and  from  the  subject- 
identities.  In  a proctored  test  such  as  FRVT  2002,  identity  information  is  not 
available  to  the  recognition  systems. 

The  watch  list  problem  is  a generalization  of  both  identification  and  verifica- 
tion. For  this  task,  a probe  p is  compared  to  a gallery  which  we  term  the  watch 
list.  The  probe  ranks  the  gallery,  so  we  state  performance  as  an  identification 
rate.  However  a significant  operational  constraint  is  that  a system  should  not 
attempt  identification  of  individuals  not  on  the  watch  list.  We  must  also,  there- 
fore, measure  a false  alarm  rate.  This  makes  clear  that  the  generalized  watch  list 
problem  is  defined  over  an  open-universe. 

In  the  next  three  subsections  we  formalize  the  watch  list  problem  and  show 
that  verification  is  the  special  case  when  the  watch  list  size  is  1,  and  identification 
conversely  is  a closed  universe  watch  list  task. 


2.1  Watch  List 

We  measure  the  watch  list  performance  using  a watch  list  Q and  two  probe 
sets:  Vg  with  subjects  who  should  be  identified  and.  V^j  with  true  imposters 
who  should  not  throw  an  alarm.  The  former  is  used  to  state  the  detection  and 
identification  rate  equal  as  the  fraction  of  probes  in  Vg  that  are  detected  at  or 
above  threshold  t and  recognized  at  rank  r or  better: 


PDi{t,r ) 


1 {Pj  : rank(pj)  < r,  sZJ  > tSidjpj)  = id(&-)}| 

\Pg\  ~ 


Vpj  <E  Vg 


(1) 


where  the  rank  is  defined  as  the  number  of  watch  list  entries  which  have  greater 
than  or  equal  similarity  to  the  probe  than  the  matching  entry: 
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nmk(pj)  = \{gk  : skj  > Sij,  id(&)  = id(pj)}l  ^9k^Q-  (2) 

Throughout  we  use  i and  k to  subscript  gallery  elements,  corresponding  to  rows 
of  the  similarity  matrix,  and  j to  subscript  the  probe  sets  corresponding  to 
columns  of  the  matrix.  In  practice,  this  needs  to  be  modified  to  handle  tied 
similarity  values:  we  elected  to  use  the  mean  of  the  lower  and  upper  ranks  of  the 
run  of  tied  values: 

2 rank (p^  = |{fc  : skj  > slj,  id (g{)  = id(pj)}|  + (3) 

\{k  ■ skj  > Sij , id(pf)  = id(pj)}l  + 1 

The  imposter  set  is  used  to  compute  the  false  alarm  rate  as  the  fraction  of 
probes  from  V\ r whose  similarity  to  any  gallery  image  is  at  or  above  threshold: 

_ . . \\pn  : max,  sa  > f}|  ^ 

P fa (t)  = irTTi ^ Pj^f  y 9i  ^ G (4) 


2.2  Identification 


Identification  is  a special  case  of  the  watch  list  task:  If  we  mandate  a closed 
universe,  then  the  false  alarm  rate  is  undefined  and  a pure  identification  rate 
specifies  performance.  Formally  for  each  probe  p from  Vg  we  sort  the  similarity 
scores  against  gallery  Q,  and  obtain  the  rank  of  the  match.  Identification  perfor- 
mance is  then  stated  as  the  fraction  of  probes  whose  gallery  match  is  at  rank  r 
or  lower.  If  the  set  of  probes  with  a close  match  is 

C(r)  = {pj  : rank(pj)  < 7’}  Wpj  G Vg  (5) 

where  the  rank  is  defined  as  before.  We  now  define  the  Cumulative  Match  Char- 
acteristic (CMC)  to  be  the  identification  rate  as  a function  of  r: 


P,(r) 


l£MI 

IT’s  I 


(6) 


which  we  plot  as  the  primary  measure  of  identification  performance.  It  gives  an 
estimate  of  the  rate  at  which  probe  images  will  be  classified  at  rank  7'  or  better.  It 
is  a non-decreasing  function  of  r.  Although  the  CMC  is  most  often  summarized 
with  rank  one  performance,  other  points  and  the  steepness  of  the  curve  are  also 
relevant  operationally.  One  drawback  of  the  characteristic  is  its  dependence  on 
gallery  size,  \Q\.  For  this  reason  we  also  plot  identification  performance  at  fixed 
rank  as  a function  of  the  gallery  size. 


2.3  Verification 

The  use  of  biometric  systems  for  the  verification  task  is  perhaps  more  common 
than  identification.  The  operational  model  assumes  that  a probe  p , is  compared 
with  its  matching  gallery  image  and  that  the  single  similarity  score  is  compared 
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Fig.  1.  Identification  rate  as  a function  of  rank  and  false  alarm  rate  for  a watch  list 
size  of  3000.  Note  the  weak  dependence  on  rank,  except  at  high  false  accept  rates.  The 
horizontal  plane  gives  the  rank  one  closed-universe  identification  rate. 


against  a threshold  to  verify  the  individual  or  otherwise.  Two  types  of  error 
can  occur  in  this  process:  first  a false  accept  in  which  an  imposter  claims  an 
identity  and  is  matched  by  the  system  above  threshold;  and  secondly  a false  reject 
in  which  the  system  incorrectly  matches  the  individual  below  threshold.  This 
method  for  accepting  or  rejecting  a claim  models  the  Neyman- Pearson  observer. 
A Neyman- Pearson  model  maximizes  the  verification  rate  for  a constant  false 
accept  rate[l]. 

The  Receiver  Operating  Characteristic  (ROC)  is  computed  to  quantify  ver- 
ification performance.  It  shows  the  tradeoff  between  the  two  types  of  error  by 
plotting  estimates  of  the  verification  rate,  Py  (i.e.  the  true  accept  rate)  against 
the  false  accept  rate,  Pfa  as  a parametric  function  of  the  prior  operating  thresh- 
old, t.  The  verification  rate  is  the  fraction  of  probes  whose  gallery  match  has 
similarity  greater  than  or  equal  to  the  threshold  value,  t: 


\{Pj-Sjj>t , id(fr-)  = id(pj)}| 

\Pc\ 


Pv(t) 


(7) 


VPj  € VQ 


The  false  accept  rate  is  computed  over  the  set  of  imposters,  V jy,  containing 
those  individuals  who  are  not  present  in  the  gallery: 


\/pj  e Ty, 


(8) 
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Fig.  2.  Watch  list  performance.  The  rank  one  detection  and  identification  rate  is  plotted 
as  a function  of  gallery  size  and  false  alarm  rate. 


where  the  denominator  shows  our  use  of  all  the  non-matching  similarities  in 
order  to  improve  the  fidelity  of  our  Pfa  estimate.  In  reality  an  imposter  would 
usually  claim  just  one  identity,  but  our  method  of  using  all  scores  is  realistic 
unless  an  imposter  has  a known  prior  resemblance  to  a person. 

Note  that  this  definition  differs  from  that  for  verification  false  accept  rate. 
Here  we  determine  if  there  exists  at  least  one  gallery  image  more  similar  to 
the  imposter  than  t.  This  occurs  more  frequently  than  in  verification  where  the 
number  of  scoes  above  t between  an  imposter  and  all  galleries  are  counted. 

The  use  of  a true  imposter  set  TV  contrasts  to  uround-robinv  evaluations, 
in  which  Vg  and  TV  are  the  same  set.  From  an  operational  standpoint,  this 
models  the  case  where  a subject,  already  with  legitimate  access  to  the  system 
(they  are  in  Vg),  attempts  to  gain  access  to  the  very  same  system,  under  a 
different  identity.  There  may  be  some  specialized  scenarios  where  this  is  a valid 
model.  However,  we  prefer  to  model  the  situation  in  which  a person  who  does 
not  already  have  access  to  the  system  attempts  verification.  In  this  model,  the 
persons  (not  just  the  images)  in  TV  are  different  from  those  in  the  gallery.  The 
rationale  for  using  true  imposters  is  that  the  non-match  distributions  estimated 
from  Sqpn  and  Sgpg  may  be  different. 

An  empirical  ROC  is  not  a curve  but  a sequence  of  steps  corresponding  to 
a set  of  operating  points  where  Py  and/or  PpA  changes.  Because  systems  are 
operated  on  the  ROC’s  convex  hull  which  corresponding  to  Py  changes,  we  use 
the  \Vg  | match  values  as  thresholds.  We  sort  these  values,  keep  their  unique 
subset  as  C,  i = l...\Pg\  and  insert  the  artificial  threshold  to  = — oc.  This 


avoids  the  use  of  a much  large  number  of  non-matches  |<y||'pA/i  in  the  ROC 
calculation.  The  computation  of  Py  proceeds  by  binning  the  number  of  match 
values  from  Sqpg  that  are  in  the  range  [£*,£*+ 1),  and  finally  by  accumulating  all 
of  them.  The  same  thresholds  and  procedure  are  used  for  Pfa  from  Sqp^- 

2.4  Limiting  Cases 

Figures  1 and  2 show  watch  list  performance  as  functions  of  watch  list  sizes,  false 
alarm  rates  and  rank.  They  were  generated  from  the  similarity  scores  of  a leading 
FRYT  2002  participant.  For  galleries  of  size  1 the  watch  list  measures  reduce 
to  verification:  the  intersection  of  the  surface  and  the  (l,y,z)  plane  is  the  ROC 
curve.  The  CMC  curve  is  the  plane  (x.l.z)  of  figure  1 and  similarly  identification 
performance  as  a function  of  gallery  size  is  given  by  the  intersection  of  the  surface 
and  the  (x.l.z)  plane,  as  shown  in  figure  2.  These  two  cases  correspond  to  the 
relaxation  of  the  threshold,  t — > — cx  whence  Ppa  — > 1 then  the  detection  and 
identification  rate  becomes  the  identification  rate,  i.e.  Pp>j{l.r)  = P/(r). 

2.5  Comparing  Verification  Performances 

Frequently  we  seek  to  compare  verification  results,  either  between  systems,  or 
across  different  data  sets  processed  by  the  same  system.  Between  systems  this 
may  be  accomplished  by  looking  at  the  verification  rate  at  fixed  false  accept 
rates.  This  is  appropriate  because  it  is  not  possible,  nor  operationally  feasible,  to 
set  a uniform  threshold  across  different  systems  that  report  on  different  ranges 
and  scales.  However,  for  studies  using  just  one  system  and  many  gallery  and 
probe  sets,  fixed  Pfa  values  correspond  to  different  thresholds,  which  are  not 
operationally  realizable.  The  correct  approach  is  to  set  a single  threshold  t and 
acknowledge  that  the  Py(t ) and  Pfa  (t)  are  random  variables  across  experiments. 

Thus,  to  be  able  to  compute  variation  in  verification  and  false  accept  rates 
across  multiple  galleries  and  probe  sets,  we  must  compute  the  ROC  using  a fixed 
global  set  of  thresholds.  Formally,  if  R experiments  use  R different  Q.  Vg  and 
Vjy  image  sets,  we  extract  the  sorted  union  of  R sets  of  match  scores.  These 
thresholds  will  generally  “oversample”  each  individual  ROC.  Thus  we  have  R 
( Pv.Pfa ) pairs  at  each  threshold,  which  we  plot  with  an  error  ellipse  that  traces 
two  standard  errors  in  the  Py  and  Pfa  dimensions.  The  principal  axes  of  this 
ellipse  are  the  eigenvectors  of  the  covariance  matrix  of  the  R pairs.  This  is  shown 
in  figure  3. 

As  a summary  statistic  we  usually  report  the  Py  and  Pfa  at  a threshold  that 
gives  Pfa  ~ 0.01.  We  favor  this  over  equal  error  rate  (when  Pfa{ t)  = 1 — Pv(i)) 
because  Pfa  — 0.01  is  an  operationally  realistic  number,  while  equal  error  rate 
not  only  varies,  but  is  usually  higher. 

3 Conclusion 

The  paper  details  the  evaluation  framework  and  scoring  metrics  used  in  the  Face 
Recognition  Vendor  Test  2002.  We  have  outlined  the  concept  of  operational  re- 
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Fig.  3.  Verification  variation.  The  figure  shows  the  results  of  computing  ROC  curves 
from  23  disjoint  populations  of  size  800. 


alizability  based  on  the  real-world  usage  of  systems  with  fixed  operating  thresh- 
olds. We  show  how  variation  in  verification  performance  must  be  computed  at 
the  same  threshold.  Motivated  by  operational  relevance  we  have  defined  true 
imposters  and  shown  via  the  non-match  distribution  that  their  use  is  necessary 
in  evaluations.  We  define  the  watch-list  scenario  and  show  that  verification  and 
identification  are  special  cases  of  it. 
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